Extending density functional theory with near chemical accuracy beyond pure water

Density functional simulations of condensed phase water are typically inaccurate, due to the inaccuracies of approximate functionals. A recent breakthrough showed that the SCAN approximation can yield chemical accuracy for pure water in all its phases, but only when its density is corrected. This is a crucial step toward first-principles biosimulations. However, weak dispersion forces are ubiquitous and play a key role in noncovalent interactions among biomolecules, but are not included in the new approach. Moreover, naïve inclusion of dispersion in HF-SCAN ruins its high accuracy for pure water. Here we show that systematic application of the principles of density-corrected DFT yields a functional (HF-r2SCAN-DC4) which recovers and not only improves over HF-SCAN for pure water, but also captures vital noncovalent interactions in biomolecules, making it suitable for simulations of solutions.

implying that approximate KS densities are very good as measured by the impact on energies. A very simply example for such case is the total energy of the helium atom, whose energy computed from the PBE functional [3] barely changes after the PBE density is replaced with the exact one. [2] However, for a significant number of chemical domains (e.g., anions and barrier heights), ∆ED can be much larger than ∆EF [4,5]. A good example for such a case is H − with the PBE functional, which gives excellent energies when evaluated on the exact densities, whereas the self-consistent PBE density cannot even bind two electrons for this anion. * esim@yonsei.ac.kr

Supplementary Note 2. Spotting and curing large density-driven errors
The following two questions are of key importance in DC-DFT: (i) How do we spot cases where ∆ED is large?; (ii) What do we do in such cases to reduce large density-driven errors? With access to exact densities we can easily answer (i), as with those we can easily measure ∆ED by Eq. 3. At the same time, we answered (ii), as ∆ED entirely vanishes with exact densities. However, exact densities are available only for small systems [2,6], and if we always had access to exact energies and densities from highly accurate wavefunction theories, we would not even bother with KS-DFT. Thus one needs to find more practical ways to answer (i) and (ii). In relation to (i), the following quantity has been introduced: and is called density sensitivity.S requires two nonempirical densities: the HF densities which are typically overlocalized and the local density approximation (LDA) densities which are typically delocalized.S serves as a practical measure of density sensitivity of a given reaction and approximate functional. For small molecules, S greater than the heuristic cutoff of 2 kcal/mol implies density sensitivity, indicating that the calculation may suffer from a large ∆ED. What should we do then to reduce large ∆ED? In these cases, evaluating an approximate functional on the HF in place of self-consistent densities will likely reduce ∆ED and likely improve the functional's performance. This procedure, called HF-DFT, is the practice of evaluating an XC approximation on the HF density and orbitals. It had been used a long before DC-DFT was proposed [7,8,9,10,11] but only because HF densities were more convenient than self-consistent densities. KS-DFT does not always benefit from HF densities (e.g, cases where ∆ED is small) and in Refs. [12,13,14] we discuss in more details formal and practical (dis)advantages of HF-DFT over SC-DFT.

Supplementary Note 3. HF-DFT and related DC-DFT procedures
Following the idea that in some cases HF-DFT works better and SC-DFT in others, DC(HF)-DFT has been proposed [13,14]. It is a procedure that discriminately uses HF densities, as DC(HF)-DFT becomes HF-DFT for cases that are both density-sensitive (S above a given cut-off value, 2 kcal/mol as discussed in Ref. [6].) and whose HF solution is not severely spin-contaminated. Otherwise, DC(HF)-DFT reverts to SC-DFT. Since HF-DFT uses the HF density as a proxy for the exact density, we only use it when there is little or no spin contamination. We calculate the expectation values of the spin-squared operator, S 2 , and only use the HF density if the <S 2 > from the HF calculation deviates less than 10% from the exact <S 2 > as discussed in Refs. [13] and [15]. Otherwise, we use the self-consistent density. As it combines the best of both of them, DC(HF)-DFT comes with a range of advantages over both HF-DFT and SC-DFT as further detailed in Ref. [13]. These advantages come at a small extra cost, as for DC(HF)-DFT we need to run up to three distinct self-consistent cycles to obtain the three densities (the SC density for a given functional and those from HF and LDA needed to calculateS). While we consider DC(HF)-DFT the stateof-the-art DC-DFT-based procedure, in the present work we want a simple DC framework that can be applied easily and routinely. r 2 SCAN and SCAN are general-purpose functionals and the ease of their use would be undermined if tandem with DC(HF)-DFT, which would require always calculatingS and possibly making adjustments to its cut-off value. For this reason and encouraged by the very good performance of HF-DFT with SCAN-like functionals [16,17], we employ HF-DFT as a DC-DFT procedure throughout this work. As said, constructing the robust and accurate HF-r 2 SCAN-DC4 is the central objective of this work. While the resulting HF-r 2 SCAN-DC4 can be routinely used by applying it to HF orbitals without ever needing to calculateS of a given reaction, the use ofS is vital for our training of HF-r 2 SCAN-DC4. Specifically, we use density-sensitivities of the training reactions to fit the D4 part of HF-r 2 SCAN. Further technical details of this fitting procedure will be given in the next section.

Supplementary Note 4. Optimizing dispersion parameters
D4 stands for the generally applicable atomic-charge dependent London dispersion correction term developed by Grimme and coworkers. [18]. It has 4 functional-dependent parameters s 6 , s 8 , a 1 , and a 2 . Following Refs. [19] and [18], we set s 6 to unity as is common for functionals that do not capture long-range dispersion interactions. We optimized the s 8 , a 1 , and a 2 parameters by minimizing the mean absolute error (MAE) for the density-insensitive GMTKN55 reactions by following the DC-DFT ideas of Ref. [12]. However, the density-insensitive reactions in GMTKN55 largely fall into two distinct parameter groups for HF-r 2 SCAN: s 8 has a negative value for noncovalent interactions, but is positive for the rest. The difference in MAE of density-insensitive cases between those two groups is miniscule (below 0.01 kcal/mol). For example, (s 8 ,a 1 ,a 2 )=(-0.20,0.07,6.50) gives 1.209 kcal/mol for the densityinsensitive MAE while (0.39,0.09,7.02) gives 1.210 kcal/mol. Such a difference is not meaningful. Small changes in computational details such as DFT grid information, two-electron operator fitting scheme, etc. changes the values of the parameters, since reaction energy errors and density sensitivity values can be changed by 0.01 kcal/mol with those changes. To eliminate this ambiguity while ensuring accuracy in water interactions, we include the densityinsensitive water· · · water pair interaction energy as a validation set. The two most stable water hexamers, the prism and the cage, are used to calculate the water· · · water 2-body interaction energy error per dimer, relative to CCSD(T)/CBS in Ref. [20]. We multiply its weight by 7 in our loss function to produce a better defined minimum and regularize the result (if we used 1, it has no effect; if we used 1000, we simply fit to this data). We can rationalize this value by noting that the mean density sensitivity of these pairs is 0.27 kcal/mol, which is about 1/7th of our density sensitivity cutoff. The resulting values for the three parameters are: -0.36, 0.23, 5.23 for s 8 , a 1 , and a 2 each.

Supplementary Note 5. Additional results for the GMTKN55 database
In Table 1, we list MAE (kcal/mol) of SC-r 2 SCAN and HF-r 2 SCAN functional with and without the dispersion correction for the chemically diverse GMTKN55 database [21]. def2-QZVPPD basis set is used.
Supplementary Note 6. Additional results for water clusters i. MD generated dimer structures The structures used in Figures 1(e) and 2(b), have been obtained from the molecular dynamics (MD) simulation. The simulation is performed within the XTB package [22] and the GFN-FF force field [23], enabling us to generate the various dimer configurations. The total simulation time is 50 ps, while the integration time step is 4.0 fs using a Berendsen thermostat at 298K in the NVT ensemble. We use the SHAKE algorithm to constrain bonds, for all bonds with 4 amu for the hydrogen atom mass. Then, we randomly selected 110 different configurations for the water· · · water dimer and 80 for the water· · · Aspirin dimers. The reference interaction energies are then calculated with DLPNO-CCSD(T)-F12/TightPNO method with the aug-cc-pvqz basis set for water· · · water dimers and aug-cc-pvtz basis set for water· · · Aspirin dimers.

ii. Many-body expansion of the interaction energy
The interaction energy can be decomposed into 2-body, 3-body, etc. by using the many-body expansion. [24] For example, the interaction energy of the water hexamer can be divided into Kbody contributions, where E K−body int is the K-body interaction energy which can be calculated from the total energy of the subcluster of the N -mer cluster: [24] (6) where S i tot stands for the total energy summation of the i-th monomer subcluster.

Supplementary Note 7. Additional results for the complexes with cytosine
In Figure 4, we plot the cytosine· · · water and cytosine· · · cytosine interaction energy error plots, respectively. The 14 different cytosine· · · cytosine configurations and reference interaction energies are from Ref. [25]. For the cytosine· · · water interaction, we place two water molecules around 14 different cytosine· · · cytosine structures and optimized the water molecular coordinates while fixing the cytosine· · · cytosine coordinates. B3LYP functional is used for the geometry optimization. DLPNO-CCSD(T)-F12/augcc-pvqz with TightPNO is used as a reference cytosine· · · water interaction energy with the ORCA package. [26] Supplementary Note 8.

Interactions including water
For the calculations shown in Figure 5(a), we combined energies of 145 reactions involving water. They include: the water hexamer isomerization in Figure 1(b), the water binding energy of WATER27 dataset in Figure 1(c), the water 20-mer isomerization in Figure 1(d), the water dimer stationary point geometry interaction energy in Figure 2(a), and water· · · small organic molecule interaction energy in Figure 11.  Supplementary Figure 8: Aspirin· · · water interaction including SC-r 2 SCAN-D4. The x-axis is the oxygen· · · oxygen distance between oxygen in the water and the specified oxygen in Aspirin (see inset of Fig .1(e)). aug-cc-pvtz basis set is used. (right) theirS value calculated from the r 2 SCAN functional. Geometries are from Ref. [20]. MAEs of each functional are 0.61, 0.04, 0.13, 0.36, and 0.43 kcal/mol compared to the CCSD(T)/CBS from Ref. [28]. The ordering is the same as the legend ordering. For HF-r 2 SCAN-DC4, we used the dispersion parameters from Section Supplementary Note 4, whilst for SC-r 2 SCAN-D4, we took the D4 parameters from Ref. [27]. The aug-cc-pvqz basis set was used for the calculations.
Supplementary Figure 10: The hexagon plot with WTMAD-2 (kcal/mol) for all GMTKN55 and its categories for selected functionals. (See Ref. [21] for the detailed description of the categories). Geometries are from the HB375 dataset described in Ref. [29]. The x-axis show the scaled distance factor, f , which is used to make a translated vector t, t = v |v| (f − 1)r ref where v is the bond direction vector and r ref is the distance between the hydrogen and the electron-donor atom. Detailed information about the x-axis can be found in Ref. [29].